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1.  Introduction 

This  paper  considers  the  implementation  of  a  standard  method,  namely  Gaussian  elimination, 
on  a  multiprocessor  based  on  the  hypercube  topology.  Our  objective  is  not  to  present  new  and 
better  algorithms,  but  rather  to  analyze  the  behavior  of  two  different  implementations  of  Gaussian 
elimination  in  a  hypercube  based  architecture.  Suppose  that  the  rows  (or  columns)  of  the  system 
are  distributed  in  some  way  among  the  processors.  At  each  step  j  of  the  first  implementation, 
which  we  call  a  broadcast  algorithm,  the  elimination  row  (or  column)  is  sent  to  all  processors  that 
hold  at  least  one  of  the  rows  i  +  1,  i  +  2, . . . ,  iV.  This  data  movement  is  a  broadcast  type  operation  : 
one  processor  sends  the  same  data  to  all  others  or  to  a  subset  of  all  others.  Once  the  elimination 
row  (or  column  of  multipliers)  has  been  broadcast,  the  arithemtic  corresponding  to  the  t**  step  of 
Gaussian  elimination  is  performed. 

The  second  approach  differs  only  in  its  organization.  Processors  are  no  longer  required  to 
perform  the  step  of  Gaussian  elimination  at  the  same  time  (or  approximately  at  the  same 
time).  A  processor  awaits  for  the  row,  passes  it  to  the  next  processor  as  soon  as  it  arrives  and 
then  proceeds  with  the  arithmetic.  The  next  processor  will  in  turn  pass  the  row  to  a  neighbor 
and  proceed  with  arithmetic.  This  idea  of  pipelining  Gaussian  elimination  is  an  old  idea  which  is 
predominant  in  particular  in  systolic  architectures,  [7,  14,  15]. 

These  two  approaches  are  well-known  and  there  is  an  extensive  literature  on  their  performance 
and  their  implementations  on  various  architectures,  see  for  example  [9,  8,  5,  6,  4,  15,  13,  19,  20] . 

Both  of  them  can  be  implemented  on  hypercubes,  but  the  second  only  requires  a  one-dimensional 
or  a  two-dimensional  grid  of  processors.  The  question  then  arises:  can  we  expect  the  broadcast 
methods,  which  take  advantage  of  the  complex  hypercube  topology,  to  outperform  the  pipelined 
methods?  As  will  be  seen  if  the  matrix  is  mapped  by  stripes,  i.e.,  a  few  rows  or  columns  per  pro¬ 
cessor,  then  the  answer  is  clear-cut:  there  is  no  compelling  reason  for  using  a  broadcast  algorithm. 

Similar  broadcast  and  pipelined  algorithms  can  also  be  derived  for  grid  mappings,  in  which  the 
matrix  is  split  into  squares  blocks  that  are  mapped  to  the  nodes  of  a  processor  grid  embedded  in  a 
hypercube.  In  case  these  grid  mappings  are  used,  and  no  pivoting  is  needed,  then  again  pipelined 
methods  are  superior.  When  pivoting  is  needed  a  solution  of  choice  appears  to  be  a  pairwise 
pivoting  technique,  see  [4,  20,  22]. 

This  paper  will  present  a  few  techniques  based  on  both  ring  and  grid  mappings  and  will  provide 
the  estimated  exection  times  for  each  algorithm.  Numerical  experiments  will  be  proposed  to  verify 
the  accuracy  of  the  model  and  then  a  conclusion  based  on  the  estimated  execution  times  will  be 
drawn. 

2.  Hypercube  topology:  properties  and  assumptions 

Hypercube  multiprocessors  have  been  very  successful  in  the  recent  years  among  academic 
institutions  as  well  as  industrial  research  groups.  There  are  currently  several  comercially  available 
parallel  processors  based  on  the  binary  n-cube  network.  For  example,  we  can  mention  Intel's  iPSC 
with  up  to  128  processors,  Ametek’s  System  14  with  256  processors,  Ncube’s  NCUBE-10  with  1024 
processors,  and  Thinking  Machines’  Connection  Machine  with  64000  processors.  It  is  likely  that 
many  others  will  soon  appear  in  the  market. 

By  definition,  an  n-dimensional  hypercube,  or  binary  n-cube,  consists  of  k  =  2"  nodes  num-  j 

bered  by  n-bit  binary  numbers,  from  0  to  2"  -  1  and  interconnected  so  that  there  is  a  link  between  j 

two  processors  if  and  only  if  their  binary  labels  differ  by  exactly  one  bit.  An  alternative  definition 
which  is  perhaps  more  helpful  in  undestanding  the  nature  of  the  topology,  is  to  define  the  hyercube  Z 

in  a  recursive  fashion: 

;  -  I  / 

•  zero-cube  consists  of  only  one  node;  L— _ _  _ _ 

'\  !  Avj.i jtjii.ty  Codes 


•  zero-cube  consists  of  only  one  node; 


I  j  'd/or 

!  / 


•  To  obtain  an  (n  +  1)—  cube  take  two  identical  n-cubes  and  link  their  corresponding  nodes  in 
a  one-to-one  fashion. 

This  is  illustrated  in  Figure  1.  Thus,  for  the  case  n  =  3,  the  8  nodes  are  simply  at  the  vertices  of  a 
three-dimensional  cube.  In  an  n-cube  each  node  has  n  neighbors  with  which  it  can  communicate. 
For  further  details  see  the  references  (2,  16,  17,  21]  and  the  references  therein. 


Figure  1:  A  Hypercube  of  Dimension  4. 

One  of  the  main  advantages  of  hypercubes,  perhaps  the  most  important  one.  is  that  many  of 
the  classical  topologies  such  as  two-dimensional  or  three-dimensional  meshes,  rings,  trees,  can  be 
embedded  preserving  proximity  in  them  (11,  3,  16,  21,  10,  ij.  An  important  consequence  of  this  is 
that  we  can  design  algorithms  for  ring  or  grid  structures  and  be  able  to  map  them  on  hypercubes. 
This  allows  us  to  choose  the  best  alternative  among  all  of  the  possible  choices.  In  the  present  paper, 
we  will  show  how  to  exploit  this  feature  when  implementing  a  simple  algorithm  such  as  Gaussian 
elimination. 

Some  properties  of  hypercubes  have  been  developed  in  [16]  and  communication  problems  in 
the  hypercube  have  been  examined  in  detail  in  [17].  We  would  like  to  summarize  some  of  the 
properties  that  will  be  useful  in  the  remainder  of  this  paper.  For  details  see  [16|. 

,\n  obvioUs  property  is  that  the  n-cube  is  a  connected  graph  of  diameter  ri.  It  is  usual  to 
numl)er  the  nodes  of  an  f)-cube  by  binary  numbers  from  0  to  2"  -  1  =  A-  -  1  according  to  the 
dehnition  of  an  u-culie.  i.e.,  so  that  two  labels  of  any  two  neighbors  differ  only  in  one  bit.  We  will 
denote  by  the  Hamming  distance  betwt*en  two  binary  numbers  A'  and  V.  i.e..  the  number 

of  I'its  that  differ  between  A’  and  V, 

.\  i)articularly  important  notion  for  the  hypercube  to])ology  is  that  of  (fray  codes.  \  Gray 
code  of  tlimension  ii  is  a  hnite  seciuence  {</,*"*, i/j"*. .  .  .(4"*_i  }  of  2"  binary  numbers  which  represent 
all  n-bit  binary  numbers  and  so  that 

where  the  subscripts  should  Ix’  considered  moilulo  2".  Thus,  i  (fray  code  defines  a  se(|uence  of  all 
th<‘  nodes  of  a  livi>ercut)e  so  that  any  two  successive  nodes  in  the  siMiueiua  art'  neighl'ors:  in  graph 
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theory  terminology  this  is  a  Hamiltonian  circuit.  Therefore  we  can  embed  a  ring  of  2"  nodes  into 
a  hypercube.  Throughout  the  paper  when  we  refer  to  a  ring  we  will  always  mean  a  ring  embedded 
in  the  hypercube  in  this  manner. 

Gray  codes  can  also  be  used  to  map  two-dimensional  or  three-dimensional  grids  into  hyper¬ 
cubes.  For  example  if  we  want  to  embed  an  2"*  by  2”*  two-dimensional  grid  into  an  n-cube  with 
ui  -f-  n2  =  n  it  suffices  to  map  the  node  {i,j)  of  the  grid,  where  0  <  t  <  2”*  —  1,  and  0  <  j  <  2"*  -  1, 
into  the  node  of  the  hypercube  whose  binary  label  is 


where  A  denotes  concatenation.  Observe  that  the  set  of  nodes  obtained  by  fixing  one  coordinate 
and  letting  the  other  vary,  forms  a  subcube.  In  other  words  every  row  or  column  of  processors  of 
the  embedded  processor  grid  forms  a  subcube  of  the  cube.  Again,  in  this  paper  a  grid  will  always 
mean  a  grid  embedded  in  an  n-cube  in  the  manner  just  described. 

We  should  briefly  mention  how  Gray  codes  can  be  generated.  Let  G„  =  {go,92 . 

be  the  n-bit  Gray  code,  with  Gi  =  {0,1}  and  denote  by  the  sequence  obtained  from  G„  by 
reversing  its  order,  and  by  0G„  (resp.  1G„)  the  sequence  obtained  from  G„  by  prefixing  a  zero 
(resp.  a  one)  to  each  element  of  the  sequence.  Then  Gray  codes  of  arbitrary  order  can  be  generated 
by  the  recursion: 

G„+i  =  {0G„,1G^},  n=l,...  (2.1) 

For  example, 

G2  =  {00,01,11,10}, 

Ga  =  {000,001,011,010, 110,  111,  101, 100}.  (2.2) 

The  particular  Gray  code  generated  by  the  above  algorithm  is  called  the  binary  reflected  Gray  code. 

Except  for  the  Connection  Machine  which  is  SIMD  and  bit  -  serial,  the  currently  existing 
hypercubes  function  in  MIMD  message  passing  mode:  a  processor  executes  its  own  program  which 
runs  independently  from  the  programs  of  other  nodes,  except  that  it  will  sometimes  require  data 
that  is  provided  by  processes  running  on  other  nodes.  Thus,  synchronization  is  achieved  only 
through  availability  of  data:  when  an  operand  is  needed  the  processor  waits  until  it  is  available 
before  pursuing  the  execution  of  its  program. 

In  order  to  estimate  the  total  time  that  is  required  to  execute  a  given  algorithm,  we  will  use 
a  simple  model  for  timing  both  arithmetic  and  interprocessor  communication.  For  arithmetic  we 
will  assume  that  it  takes  the  time  w  to  execute  a  pair  of  operations  consisting  of  an  addition  and 
a  multiplication.  Performing  m  such  pairs  will  require  the  time  mu;. 

For  communication,  we  assume  that  it  lakes  the  time 


Id+mr,  (2.3) 

to  transfer  m  words  from  one  processor  to  any  of  its  n  neighbors,  where  0  is  the  communication 
start-up  in  seconds  and  r  is  the  elemental  transfer  time,  in  seconds  per  word.  Moreover,  commu¬ 
nication  in  all  the  n  channels  can  take  place  simultaneously:  a  node  can  do  either  a  receive  from 
or  a  send  to  any  neighbor  while  doing  a  receive  or  a  send  with  another  neighbor.  This  assumption 
is  important  because  it  implies  that  the  total  communication  speed  from  a  node  is  proportional  to 
its  fanout  and  therefore  that  the  hardware  can  be  effectively  utilized. 

In  some  of  the  Gaussian  elimination  algorithms  we  need  to  move  the  same  data,  i.e.,  a  row  or 
a  column,  from  a  given  node  to  all  other  nodes.  This  is  called  a  broadcast  operation.  standard 
algorithm  to  broadcast  a  data  packet  D  from  node  0”  to  all  other  nodes  in  an  ri-cube  is  the  following. 
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ALGORITHM:  Hypercube  Broadcast 

In  every  node  L  do: 

For  j  =  1,2,3,. .  .n  do: 

If  L<2^~^  move  D  to  node  L  + 


This  basic  algorithm  is  not  optimal.  If  D  consists  of  N  words  it  will  take  an  approximate  time 
of 

n  [iVT  ■¥  0\  =  {Logik)  [iVr  +  0\  (2.4) 

to  execute  the  above  algorithm.  To  improve  efficiency,  one  can  partition  the  data  set  D  into  i/ 
equal  packets  and  pipeline  the  data  packets  in  succession.  There  is  an  optimal  number  of  packets 
and  the  corresponding  optimal  time  to  broadcast  N  words  in  an  n-cube  is  [17]  : 


(2.5) 


However,  in  this  paper  we  will  use  the  simple  upper  bound  given  by  the  following  inequality 


topt{N,  n)  <  2(iVr  +  n0)  =  2{Nt  +  0Logik). 

This  upper  bound  is  within  a  factor  of  about  2  of  the  actual  best  time  (2.5). 


(2.6) 


3.  Row  and  column  oriented  algorithms 

[•  3.1.  Broadcast  Row  Algorithm 

The  simplest  way  to  implement  Gaussian  elimination  in  parallel  is  to  subdivide  the  matrix  A 
into  k  blocks  of  y  rows  each  and  assign  one  block  to  each  processor  of  the  ring  successively.  If 
Po.Pi, . .  .Pfc-i  are  the  nodes  of  the  hypercube,  in  a  natural  order,  then  let  processor  P,  hold  rows 
iy  +  1  to  (»  +  1)^  of  A  and  the  corresponding  components  of  the  right  hand  side  vector  6,  as 
indicated  in  Figure  2. 


b 


Processor  Po  (Idle) 
Processor  Pi  (Idle) 
Processor  Pj  (Active) 
Processor  P3  (Active) 


Processor  Pk-\  (Active) 


Figure  2:  Gaussian  Elimination  on  a  Block  Row  Partitioned 


At  step  j,  row  j  must  be  sent  from  the  processor  where  it  is  located,  say  processor  P,.  to 
processors  P,+i  ...Pjt-i  order  to  perform  the  eliminations  in  each  of  them.  From  Section  2  we 
know  that  transferring  a  row  of  length  A*  —  7  +  1  to  all  processors  requires  a  time  bounded  from 
above  by 

2{dLog2k-^-  (N  -  J  +  1)t).  (3.1) 

Here  we  do  not  have  to  move  the  row  to  all  processors  but  only  the  processors  P,+] .....  Pt-j .  i.e.,  to 
k  -  i  -  I  processors.  Since  the  data  is  initially  arranged  so  that  block  i  is  in  Processor  P,-i,  then  it 
is  easy  to  see  that  the  processors  P,.  P,+] ,  Pi+2,  ■  ■  .Pk-i  are  in  a  subcube  of  dimension  \Log2{k  -  i)] 
and  therefore  we  can  perform  a  limited  broadcast  in  \Log2{k  —  f)]  steps.  The  communication  time 
for  each  step  is  then  bounded  from  above  by 

2(\Log2{k  -  +  (A'  -  j)t)  <  2({Log2(k  -  i)  +  l)l3  +  (A'  -  j)t). 


Summing  up  for  j  =  I,. .  .  N: 

V  jV 

2j  Yi^Log2{k  -  i)]l3  +  2^(N-  j)r  <  2— (A:  +  log2(k!))/9  +  2  ^ (A'  -  j)r. 

^  1=0  y=i  y=i 

From  Stirling’s  formula  we  get  the  approximate  communication  time 

k  2k 

tc  ^  A''(H-  Log2~)0  +  N'^t  =  N(Log2 — )0  +  N^r. 
e  e 

To  determine  the  time  for  arithmetic,  we  note  that  at  step  j  a  processor  performs  at  most  j 
eliminations  which  require  the  time 

^(A^->  +  l)u;.  (3.2) 

Summing  up  over  N  -  1  steps,  we  obtain  the  estimated  arithmetic  time 

ar3 

tA  ^  (3-3) 

Hence  the  total  estimated  time  for  this  algorithm  is 

Tn  ^  — pw  +  2\Logi—0  +  N'^t.  (3.4) 

2k  e 

Notice  the  deterioration  in  efficiency  in  the  arithmetic  time  due  to  the  denominator  2k  instead 
of  the  ideal  3A-  in  the  term  of  (3.4).  The  reason  for  this  loss  of  efficieuc>  is  that  at  the  end  of  the 

elimination  process  many  processors,  in  fact  most  of  them  at  the  very  end.  are  idle.  A  remedy  is 

to  interleave  the  rows  with  respect  to  processors,  an  idea  used  in  [9].  The  corresponding  mapping 
is  illustrated  in  Figure  3. 

To  estimate  the  communication  time  for  this  new  version  of  the  algorithm,  we  only  observe 
that  the  difference  with  the  previous  algorithm  is  that  a  row  must  now  be  sent  to  all  processors  at 
every  step  (except  the  very  last  k  steps  but  the  corresponding  difference  is  negligible.)  Thus  com¬ 
munication  contributes  to  each  step  of  the  algorithm  a  time  bounded  from  above  by  the  expression 
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Processor  Pq 
Processor  Pi 
Processor  P2 
Processor  P3 
Processor  Pq 
Processor  Pi 
Processor  P2 
Processor  P3 


Figure  3:  Gaussian  Elimination  with  Interleaved  Rows  for  k  =  4. 

(3.1),  whose  total  over  the  -  1  steps  of  Gaussian  elimination  is  about  N'^t  +  2N Log2{k)/3.  Con¬ 
cerning  arithmetic,  observe  that  at  step  j,  a  processor  typically  performs  the  linear  combination  of 
row  j  with  [(A'  -  j)/k]  rows.  Therefore,  each  step  will  now  cost 

+ (3.5) 

instead  of  (3.2).  Defining  the  function 

N-i  . 

/(A^,fc)=  j;[il(,-H),  (3.6) 

1=1 


we  get  the  estimated  time 


Tr.b  =  /(A'.t)w'  +  N^t  -I-  2N  (Log2k)0. 

Note  that  if  tue  integer  division  of  A'^  —  1  by  Ir  is  A^  —  1  =  /:s  -f-  r  then  we  have 

f(N,k)  =  sk 


(3.7) 


(s  -t-  l)(/r  -I-  3)  k(s'^  -  1) 


+  ^(2.V-r+l). 


When  k  <<  N.  then  f{N,k)  can  be  approximated  by  N^/(3k)  which  yields 


r3 


Tr.b  ^  +  2A'  (Log2k)0  for  k  <<  N. 

oK 


(3.8) 


Notice  that  a  loss  of  efficiency  takes  place  when  k  %  .V.  In  fact,  for  k  =  .V  we  obtain 
that  f(.\.k)  =  5(.V  —  1)(A'  -f-  2).  which  means  that  the  algorithm  has  an  arithmetic  efficiency  of 


approximately  2/3.  By  arithmetic  efficiency  we  mean  the  efficiency  that  would  be  obtained  in  an 
ideal  situation  where  communication  times  were  negligible. 

We  should  also  point  out  that  the  above  scheme  can  be  generalized  into  a  column  oriented 
scheme  in  a  straighforward  manner.  One  of  the  reasons  why  column  schemes  have  often  been 
preferred  to  row  oriented  schemes  is  that  row  interchange  is  easy  to  implement  with  them.  However, 
they  have  the  disadvantage  of  leading  to  more  inefficient  triangular  systems  solutions  [9].  Moreover, 
note  that  for  row  schemes  column  interchange  can  be  used  instead  of  row  interchange. 

3.2.  Pipelined  Ring  Algorithm 

In  the  pipelined  ring  algorithm,  the  computation  is  rearranged  so  that  only  nearest  neighbor 
communication  is  required.  We  consider  the  general  case  where  the  number  of  processors  k  is  larger 
than  A',  in  fact  ideally  much  larger,  and  assume  the  interleaved  distribution  illustrated  in  Figure  3. 
Here  we  denote  by  Po,Pi, . .  .Pt_i  the  sequence  of  processors  that  form  a  ring  embedded  in  the 
hypercube  using  the  Gray  code  embedding  of  section  2.  The  idea  of  the  algorithm  is  that  every 
processor  executes  the  iV  —  1  successive  elimination  steps  of  Gaussian  elimination  on  the  rows  that 
it  holds.  To  do  so  it  must  receive  and  store  the  pivot  row,  send  it  immediatly  to  the  next  processor 
in  the  ring,  and  then  proceed  with  the  elimination.  Schematically  the  algorithm  is  as  follows. 

For  j  =  1, ...,  N  —  1,  do  in  every  processor 

1.  Receive  the  j‘^  row  from  processor  P,_i  and  store  it. 

2.  Send  the  j‘^  row  just  received  to  the  processor  P,+i 

3.  Perform  the  elimination  step  number  j  for  all  rows  m,  m>j  that  belong  to  processor  P,. 

It  might  seem  more  natural  at  first  to  interchange  2  and  3,  i.e.,  to  compute  as  soon  as  the  row 
is  available  and  send  the  row  to  the  next  processor  after  the  arithmetic  is  completed  but  this  is  not 
as  efficient  since  the  next  processor  will  be  idle  waiting  for  the  row  to  arrive  while  the  Processor  P, 
is  computing. 

Let  us  now  estimate  the  execution  time  of  this  a'gorithm.  To  this  end  we  look  at  the  block 
consisting  of  the  k  last  rows  of  the  system,  see  Figure  3.  Without  loss  of  generality  we  will  make  the 
additional  assumption  that  the  size  N  of  the  matrix  is  a  multiple  of  A:,  the  number  of  processors. 
The  key  observation  is  that  the  last  row  of  the  linear  system,  which  is  held  in  the  last  processor 
P*-i,  is  the  critical  row  in  the  sense  that  the  job  will  be  completed  when  the  last  step  of  Gaussian 
elimination  will  be  performed  on  this  last  row. 

Row  1  will  reach  the  last  processor  Pt-i  after  crossing  k  —  1  processor,  i.e.,  after  the  time 

(fc -!)[/?  + AV]  (3.9) 

Thereafter,  processor  P*_i  will  use  it  for  the  elimination  and  then  receive  the  next  row  which  should 
be  ready  to  be  moved  from  processor  Pk-2-  Thus,  each  step  j  requires  the  linear  combination  of 
the  row  j  with  [(A'  -  j)/k]  rows,  namely  all  those  rows  among  rows  j  +  1,  j  +  2, . . . ,  A'  that  are 
held  in  processor  P/t-j.  This  is  followed  by  a  communication  task  to  receive  the  next  row  which  is 
of  length  N  -  j.  Therefore,  in  Processor  P*_i  each  step  j  of  Gaussian  elimination  will  require  the 
time  , 

[^-^1  (N-J  +  l)a;  +  (A’  -  +  0. 

Summing  up  these  times  over  j  =  ^....A"^  —  1  and  adding  the  result  to  the  delay  time  (3.9),  we 
get  by  use  of  the  definition  (3.G)  and  by  dropping  lower  order  terms: 

Tr.R  =  /(-V.  +  ^-V(A'  +  2k)r  +  {N  +  k)0  (3.10) 
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(3.10) 


For  k  <<  N  we  can  again  approximate  /  as  in  section  3.1  to  obtain  the  approximate  total 


1 

+ -N{N  +  2k)T  +  {N  +  k)0  for  k  <<  N 

oK  2» 


(3.11) 


Notice  that  the  arithmetic  time  is  the  same  as  that  of  the  broadcast  algorithm.  Incidentally, 
this  means  that  when  communication  is  very  fast  with  respect  to  arithmetic,  then  the  two  algorithm 
will  show  the  same  performance.  For  the  particular  case  where  k  =  N ,  i.e.,  in  the  case  of  one  row 
per  processor,  we  have  f{N\k)  =  ^(A'’  -  l)(iV  +  2),  and  therefore  an  arithmetic  efficiency  of  2/3  is 
achieved  just  as  for  the  broadcast  algorithm. 

.A.n  important  difference  between  the  above  estimate  and  that  of  the  broadcast  algorithms  is 
that  the  contribution  of  communication  start-ups,  i.e.,  the  0  term,  is  now  linear  in  k  and  N  while 
for  the  broadcast  algorithm  this  contribution  was  of  the  form  O(0N Log2k).  If  0  is  large  compared 
with  the  coefficients  r  and  w,  as  is  the  case  with  Intel’s  iPSC  for  example,  then  this  term  may 
dominate  the  total  time  when  k  =  0{N).  This  is  widely  verified  by  our  numerical  experiments. 


4.  Grid  algorithms 

In  this  section  we  assume  that  the  number  of  processors,  denoted  by  k\  is  of  the  form  A-  =  2", 
where  n.  the  dimension  of  the  cube,  is  even,  and  we  define  k  =  \/k  =  2"/^.  We  consider  the 
assignment  of  the  linear  system  into  the  nodes  of  the  hypercube  that  is  the  result  of  mapping  the 
system  into  a  \/k  x  \/k  square  grid  of  processors  which  is  embedded  into  the  hypercube.  This  is 
illustrated  in  Figure  4,  where  a  label  (t,/)  in  a  block  means  that  this  block  is  mapped  into  the 
processor  which  in  the  row  i  and  column  j  of  the  processor  grid. 

Similarly  to  row  oriented  algorithms  we  wdll  consider  broadcast  and  pipelined  algorithms. 

4.1.  The  broadcast  grid  algorithm 

In  the  broadcast  grid  algorithm,  we  must  now  broadcast  in  two  directions;  we  must  send  the 
elimination  sub-rows  to  every  processor  in  the  same  vertical  line  of  processors  of  the  grid,  and  the 


sub-column  of  multipliers  horizontally.  Moreover,  the  multipliers  must  be  computed  before  being 
moved.  In  other  words  the  j-th  step  could  be  described  simply  as  follows: 

1.  Broadcast  the  element  ajj  vertically. 

2.  Compute  the  multipliers  dij/ajj  in  the  processors  that  hold  the  elements  a,j. 

3.  Broadcast  the  sub-columns  of  multipliers  horizontally  and  the  elimination  sub-rows  vertically. 

4.  Proceed  with  the  elimination  process  in  each  processor. 

Remember  that  each  column  or  row  of  processors  in  the  embedded  processor  grid  is  a  subcube 
of  dimension  n/2.  Therefore  step  1,  which  is  a  broadcast  of  a  single  element,  takes  the  time 
(Log2K)(0+T).  Step  2  consists  of  at  most  f(iV— jI/k]  arithmetic  operations  and  costs  [(A’^-j)/k]w. 

Step  3  is  a  broadcast  of  at  most  [(A’’  —  j)/k]  words  in  each  direction.  The  two  data  transfers 
can  be  overlapped  in  the  two  horizontal  and  vertical  subcubes  of  dimension  n/2  and  therefore  the 
time  required  for  the  broadcast  in  step  3  is  bounded  from  above  by 

2{\{N  -  j)/K]T  +  0Log2K) 

Note  that  as  usual  we  take  the  largest  time  as  given  by  the  ceilings  f.]  since  these  reflect  the  times 
taken  by  the  processors  that  finish  last. 

Finally,  step  4  consists  of  linear  combinations  of  [(^V  -  j)/K]  rows  of  length  [(A’’  —  j)/K]  each 
and  hence  the  time  for  this  step  is 

f(A^-i)/KlV 

Adding  the  above  times  arid  summing  them  over  the  steps  j  =  1 . A'  -  1  we  find  the  following 

the  total  time  valid  when  N  is  divisible  by  k 

Tb,g  =  1)(9  +  2)u;  -t-  +  2^^0Log2k,  (4.1) 


where  here 


N 

5  =  — . 


Observe  that  in  the  limiting  case  where  k  =  the  time  to  solve  a  linear  system  by  this 
algorithm  is  dc.  .,uated  by  the  communication  start-up  time  ^dN Log2N .  Therefore,  this  is  not  an 
effective  method  for  this  case  since  there  are  algorithms  that  realize  Gaussian  elimination  in  O(A') 
time  using  0(A'^)  processors.  The  pipelined  method  described  next  ^  one  such  algorithm. 

4.2.  The  pipelined  grid  algorithm 

We  now  consider  a  pipelined  Gaussian  elimination  algorithm  for  the  grid  mapping  of  Figure  4. 
It  is  assumed  here  that  no  pivoting  is  used.  The  generalization  of  the  ring  pipelined  algorithm 
can  be  easily  understood  if,  conceptually,  we  view  each  row  of  processors  in  the  processor  grid, 
as  one  processor  which  is  part  of  a  ring  of  k  processors.  To  implement  the  pipelined  ring  algo¬ 
rithm  we  observe  that  the  multipliers  must  now  be  also  moved  from  left  to  right  in  each  row  of 
processors.  Thus  we  must  pipeline  the  information  in  both  horizontal  and  vertical  directions.  The 
implementation  is  a  straightforward  generalization  of  the  pipelined  ring  algorithm.  A  processor 
typically  receives  a  pivot  row,  then  forwards  it  to  the  next  processor  down  and  a  similar  operation 
is  done  for  the  column  of  multipliers.  As  soon  as  both  data  have  arrived  the  processor  proceeds 
with  the  arithmetic.  An  important  detail  is  that  at  the  fth  step,  the  multipliers  are  computed  in 
the  processors  that  contain  the  elements  a,y. 
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Looking  at  the  first  block-row  of  .4,  contained  in  processors  Pi  ,,  the  rightmost  column  of  A. 
will  receive  the  N/k  multipliers  with  a  delay  of 

N  N 

(k  —  +  0)  A  — uJ 

K.  K 

i.e..  the  time  to  travel  across  k  —  1  processor  and  to  form  the  multipliers  in  the  leftmost  processors. 
This  assumes  that  each  processor  receives  the  column  then  sends  it  immediatly  to  its  right  neighbor 
before  proceeding  with  the  arithmetic. 

Looking  at  the  rightmost  column  of  processors,  we  use  again  the  fact  that,  as  for  the  ring 
algorithm,  the  last  row  is  critical,  in  the  sense  that  it  determines  the  total  time  of  the  algorithm. 
Every  new  step  will  consist  of  an  arithmetic  task  whereby  f (A'  —  j)/k]  sub-rows  of  size  [ {N  -  j)/k] 
each  are  combined  with  the  pivot  sub-row,  and  then  sending  the  previously  received  pivot  sub-row 
downward.  Thus  every  new  step  j  in  the  bottom  rightmost  proceesor  will  require  the  time 


+  0 


K  '  K 

Summing  these  times  over  the  steps  j  =  1,2, ...  A'  —  1  and  adding  the  above  starting  delay  we  get 
the  approximate  time 


Tpo  =  +  1)^^  +  2^(9  +  5)r  +  (A’  -f  2k)/?, 


(4.3) 


where  again  q  is  defined  by  (4.2).  Observe  that  as  noted  earlier  a  time  of  0(N)  can  be  achieved. 


5.  Numerical  experiments 

.\s  an  experiment  we  tested  the  broadcast  row  algorithm  and  the  pipelined  ring  algorithm  on 
Intel's  iPSC-d7,  which  is  a  hypercube  of  dimension  seven.  The  linear  system  was  taken  so  that  its 
solution  is  known.  More  precisely  we  chose  a  system  of  size  size  N  =  127,  and  the  matrix 


a,j  =  1  for  j  /  t,  and  o,.,  =  A’  -f-  1. 

All  components  of  the  right  hand  side  6  are  taken  equal  to  2N.  so  that  the  exact  solution  is  known 
to  be  J  ==  (1, 1. 1 . . .  1)^. 

We  implemented  a  column  oriented  algorithm  rather  than  a  row  oriented  one.  The  linear 
system  is  therefore  assigned  by  columns  using  interleaving,  the  right  hand  side  being  considered  as 
the  last  column  of  the  matrix.  For  the  pipelined  algorithm  we  used  the  Gray  code  to  embed  a  ring 
Pq.Pi . Pk-i-  This  was  not  d-ne  for  the  broadcast  algorithm  for  which  it  is  not  needed. 

We  solved  the  linear  system  with  the  two  methods  varying  the  number  of  processors  from 
/,  =  2  to  k  =  128  in  each  case.  This  can  be  easily  done  by  varying  the  dimension  of  the  iPSC.  Thus 
the  results  shown  are  only  for  numbers  of  processors  that  are  powers  of  2.  The  comparison  of  the 
performances  of  two  methods  shown  in  Figure  5  confirms  the  prediction  that  pipelined  methods 
are  generally  faster. 

To  check  the  accuracy  of  the  models  of  performance  analysis  proposed  in  this  paper  and  in 
other  papers  [9.  12,  18,  17].  we  compared  the  real  times  and  the  estimated  times  provided  by  our 
models.  For  the  broadcast  algorithm  we  used  the  non-optimal  formula  (2.4)  instead  the  upper 
bounds  (2.6),  as  it  reflects  the  present  state  of  the  iPSC.  The  iPSC  uses  packets  of  length  of  IKB: 
hence  in  our  case  the  number  of  packets  is  always  one.  The  communication  start-up  time  was 
measured  to  be  approximately  G.5  nif<.  The  peak  bandwidth  of  each  channel  is  lOMbits/sec  which 
ei\es  T  —  3.2// .s.  However,  various  measures  conducted  elsewheri’  lead  to  the  the  actual  measured 


time  T  ^  &IJS.  corresponding  to  less  than  half  the  optimal  speed.  As  for  the  parameter  we 
used  ft  9.25  x  10“"^.  This  was  obtained  by  using  two  processors  and  evaluating  the  number  of 
operations  performed  as  well  as  estimating  the  communication  time  w^hich  is  then  deducted.  Thus 
this  time  for  one  pair  of  multiplication  and  one  addition  includes  such  times  as  memory  fetches, 
loop  up-dating  and  so  on.  The  plot  in  Figure  6  shows  the  result  for  the  broadcast  algorithm  while 
that  of  Figure  7  shows  the  result  for  the  pipelined  algorithm.  The  dashed  lines  in  both  plots  are 
the  estimates.  Both  are  relatively  accurate  given  the  simplifying  assumptions  made  to  derive  the 
models.  Moreover,  our  experience  is  that  in  the  pipelined  algorithm  the  communication  start-up 
varies  substantially.  For  example,  it  seems  that  when  a  receiv'e  command  is  issued  while  a  send 
command  is  still  being  executed,  then  0  is  increased,  possibly  doubled.  In  fact,  as  is  shown  in 
Figure  8.  a  much  better  agreement  is  obtained  for  /?  =  10  ms.  which  roughly  assumes  that  the 
above  conflict  occurs  half  of  the  time. 

6.  Discussion 

In  what  follows  we  make  a  few  concluding  remarks  and  examine  briefly  some  of  the  issues  such 
as  pivoting,  that  have  not  been  addressed  in  this  paper. 

1.  There  are  no  particular  reasons  for  using  a  broadcast  algorithm.  Plots  of  the  theoretical  times 
for  various  values  of  the  parameters  ^,cj,  t,  show  the  pipelined  algorithm  to  be  always  either 
faster  than,  or  very  close  to  (for  small  fc),  that  of  the  broadcast  algorithm. 

2.  Similarly,  grid  algorithms  seem  to  be  generally  superior  to  row  or  column  oriented  algorithms. 
Ring  algorithms  cannot  be  used  for  k>N  but  this  is  only  part  of  the  story.  Not  only  do 
pi{)elined  grid  algorithms  have  smaller  communication  times  but  also  when  k  =  0(.V)  the 
arithemtic  efficiency  of  the  row  algorithms  deteriorates  to  2/3.  On  the  other  hand  since  in  this 
situation  \/k  <<  .V,  an  arithmetic  efficiency  of  nearly  one  is  achievable  for  the  grid  algorithms. 

3.  Partial  pivoting  is  easy  to  implement  for  column  or  row  mappings.  For  grid  mappings,  partial 
pivoting  is  no  longer  attractive  and  should  be  replaced  by  pairwise  pivoting  which  is  straight¬ 
forward  to  implement  [4]. 

4.  .\n  interesting  conclusion  is  that  the  hypercube  topology  does  not  seem  to  do  any  better  than 
the  grid  topology  for  Gaussian  elimination.  Hypercubes  would  be  superior  if  partial  or  total 
pivoting  were  crucial  for  stability  but  the  recent  paper  by  D.  Sorensen  [22]  seems  to  indicate 
that  pairwise  pivoting  will  always  be  good  enough  in  practice. 

5.  \’arious  methods  for  solving  the  resulting  triangular  system  have  been  devised  for  various 
mappings,  except  grids  in  [9j.  For  reason  of  space  we  defer  the  study  of  these  to  another 
report.  We  should  only  point  out  that  there  are  efficient  ring  algorithms  for  k  <<  .V  and  grid 
algorithms  for  \/k  <<  .V.  However  when  \/k  =  0(N)  then  it  may  be  a  better  idea  to  switch 
to  the  Gauss-Jordan  method  which  avoids  triangular  systems  solutions. 
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Figure  5:  Comparison  of  the  broadcast  row  algorithm  (B) 
and  the  pipelined  ring  algorithm  (P). 
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